Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Ilay Kamai | 305434896 | ilay.kamai@campus.technion.ac.il |
| Student 2 | Yanai Soker | 205733421 | yanay.soker@campus.technion.ac.il |
print('hi')
hi
In this first homework assignment we'll familiarize ourselves with PyTorch as a
general-purpose tensor library with automatic gradient calculation capabilities.
We'll use it to implement some traditional machine-learning algorithms and remind ourselves about
basic machine-learning concepts such as different data sets and their uses, model hyperparameters, cross-validation,
loss functions and gradient derivation. We'll also familiarize ourselves with other highly useful
python machine learning packages such as numpy, sklearn and pandas.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.
PyTorch¶In this part, we'll learn about the Dataset and DataLoader classes which are part of PyTorch's torch.util.data package.
These are highly useful abstractions that can greatly reduce the amount of boilerplate code you need to write in order to work with data.
Knowing how to use these classes properly will prove useful in the coming assignments and course project.
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(42)
test = unittest.TestCase()
The Dataset class is an abstraction over a sequence of python objects,
each representing a sample (with or without a label). it's main purpose is
to load a single (possibly labeled) sample from some soure (disk, web, etc) into memory,
and transform it into a usuable representation (e.g. image to tensor).
The Dataset abstracts away exactly when the data is loaded into memory: It can be on
demand when each sample is accessed, all in advance or some combination using e.g. caching.
This is implementation-specific.
As a warm up, lets create a demonstration Dataset that returns noise images. It should:
(C, W, H) containing random contents.0 and num_classes-1.[0, 255].num_samples labeled images.First, let's implement a simple function to generate a labelled random image.
TODO Implement the random_labelled_image function in the hw1/datasets.py module.
Use the code below to test your implementation.
import hw1.datasets as hw1datasets
import cs236781.plot as plot
image_shape = (3, 32, 64)
num_classes = 3
low, high = 10, 100
# Generate some random images and check values
X_ = None
for i in range(100):
X, y = hw1datasets.random_labelled_image(image_shape, num_classes, low, high, dtype=torch.uint8)
test.assertEqual(X.shape, image_shape)
test.assertIsInstance(y, int)
test.assertTrue(0<= y < num_classes)
test.assertTrue(torch.all((X >= low) & (X < high)))
if X_ is not None:
test.assertFalse(torch.all(X == X_))
X_ = X
print(y)
plot.tensors_as_images([X, X_]);
2
In many cases we'll need to consistently get repeatable results even though we're using pseudo-random number generators (PRNGs). The way to do this is to provide a seed to the generator. Given the same seed, a PRNG will always generate the same sequence of numbers.
Here, we need a way to generate the same random image when accessing our dataset at the same index (e.g. to simulate a real set of images).
TODO Implement the torch_temporary_seed function in the hw1/datasets.py module.
Use the code below to test your implementation.
seeds = [42, 24]
torch.random.manual_seed(seeds[0])
# Before the context, the first seed affects the output
data_pre_context = torch.randn(100,)
with hw1datasets.torch_temporary_seed(seeds[1]):
# Within this context, the second seed is in effect
data_in_context = torch.randn(100,)
# After the context, the random state should be restored
data_post_context = torch.randn(100,)
data_around_context = torch.cat([data_pre_context, data_post_context])
# Use first seed, generate data in the same way but without changing context in the middle
torch.random.manual_seed(seeds[0])
data_no_context = torch.cat([torch.randn(100,), torch.randn(100,)])
# Identical results show that the context didn't affect external random state
test.assertTrue(torch.allclose(data_no_context, data_around_context))
# The data generated in the context should match what we would generate with the second seed
torch.random.manual_seed(seeds[1])
test.assertTrue(torch.allclose(data_in_context, torch.randn(100,)))
Now we can implement the dataset as required.
TODO Implement the RandomImageDataset class in the hw1/datasets.py module.
Use the code below to test your implementation.
# Test RandomImageDataset
# Create the dataset
num_samples = 500
num_classes = 10
image_size = (3, 32, 32)
ds = hw1datasets.RandomImageDataset(num_samples, num_classes, *image_size)
# You can load individual items from the dataset by indexing
img0, cls0 = ds[139]
# Plot first N images from the dataset with a helper function
fig, axes = plot.dataset_first_n(ds, 9, show_classes=True, nrows=3)
# The same image should be returned every time the same index is accessed
for i in range(num_samples):
X, y = ds[i]
X_, y_ = ds[i]
test.assertEqual(X.shape, image_size)
test.assertIsInstance(y, int)
test.assertEqual(y, y_)
test.assertTrue(torch.all(X==X_))
# Should raise if out of range
for i in range(num_samples, num_samples+10):
with test.assertRaises(ValueError):
ds[i]
This simple dataset is a useful abstraction when we know in advance the number of samples in our dataset and can access them by indexing. However, in many cases we simply cannot know about all data in advance. For example, perhaps new data is generated in real time.
To deal with these cases, we can use a different type of abstraction: an IterableDataset which provides an interface only to iterate over samples, but not to index them directly.
Let's implement such a dataset which will allow us to iterate over an infinite stream of randomly-generated images.
ds = hw1datasets.ImageStreamDataset(num_classes, *image_size)
# This dataset can't be indexed
with test.assertRaises(NotImplementedError):
ds[0]
# There is no length
with test.assertRaises(TypeError):
len(ds)
# Arbitrarily stop somewhere
stop = torch.randint(2**11, 2**16, (1,)).item()
# We can iterate over it, indefinitely
for i, (X, y) in enumerate(ds):
test.assertEqual(X.shape, image_size)
test.assertIsInstance(y, int)
if i > stop:
break
print(f'Generated {i} images')
test.assertGreater(i, stop)
Generated 22112 images
Now that we've created a simple Dataset to see how they work, we'll load one of pytorch's built-in datasets: CIFAR-10. This is a famous dataset consisting of 60,000 small 32x32 color images classified into 10 classes. You can read more about it here.
The torchvision package has built-in Dataset classes that can download the data to a local folder,
load it, transform it using arbitrary transform functions and iterate over the resulting samples.
Run the following code block to download and create a CIFAR-10 Dataset. It won't be downloaded again if already present.
Run the following block to download CIFAR-10 and plot some random images from it.
import os
import torchvision
import torchvision.transforms as tvtf
cfar10_labels = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
data_root = os.path.expanduser('~/.pytorch-datasets')
cifar10_train_ds = torchvision.datasets.CIFAR10(
root=data_root, download=True, train=True,
transform=tvtf.ToTensor()
)
print('Number of samples:', len(cifar10_train_ds))
# Plot them with a helper function
fig, axes = plot.dataset_first_n(cifar10_train_ds, 64,
show_classes=True, class_labels=cfar10_labels,
nrows=8, hspace=0.5)
Files already downloaded and verified Number of samples: 50000
Now that we've loaded the entire CIFAR-10 dataset, we would like to work with a smaller subset
from it to reduce runtime of the code in this notebook.
A simple way to achieve this with Datasets is to wrap a Dataset in another Dataset that does this for us. This will make it easy to use our subset with DataLoaders as you will see later.
TODO Complete the implementation of SubsetDataset in hw1/datasets.py and use the following code block to test.
subset_len = 5000
subset_offset = 1234
cifar10_train_subset_ds = hw1datasets.SubsetDataset(cifar10_train_ds, subset_len, subset_offset)
dataset_x, dataset_y = cifar10_train_ds[subset_offset + 10]
subset_x, subset_y = cifar10_train_subset_ds[10]
# Tests
test.assertEqual(len(cifar10_train_subset_ds), subset_len)
test.assertTrue(torch.all(dataset_x == subset_x))
test.assertEqual(dataset_y, subset_y)
with test.assertRaises(IndexError, msg="Out of bounds index should raise IndexError"):
tmp = cifar10_train_subset_ds[subset_len]
Notice that when we initialized the Dataset instance for CIFAR-10, we provided a transform parameter.
This is a way to specify an arbitrary transformation that should be run on each sample prior to returning it from the dataset.
In the above, we used the ToTensor() transformation from torchvision.transforms to convert the
images from a PIL (Python Imaging Library) image object which has a shape of 32x32x3 and values in range [0, 255] into a pytorch Tensor of shape 3x32x32 and values in range [0, 1].
To demonstrate the use of transforms, we'll implement two custom transforms which invert the colors and flip the images around the horizontal axis.
TODO Complete the InvertColors and FlipUpDown classes in the hw1/transforms.py module.
import hw1.transforms as hw1transforms
cifar10_inverted_ds = torchvision.datasets.CIFAR10(
root=data_root, download=True, train=True,
transform=tvtf.Compose([ # Compose allows us to chain multiple transforms in a sequence
tvtf.ToTensor(), # Convert PIL image to pytorch Tensor (C,H,W) in range [0,1]
hw1transforms.InvertColors(),
hw1transforms.FlipUpDown(),
])
)
fig, axes = plot.dataset_first_n(cifar10_inverted_ds, 64,
show_classes=True, class_labels=cfar10_labels,
nrows=8, hspace=0.5)
test.assertTrue(torch.allclose(cifar10_train_ds[0][0], torch.flip(1.-cifar10_inverted_ds[0][0], [1])),
"Wrong custom transform")
Files already downloaded and verified
DataLoaders and Samplers¶We have seen that a Dataset is simply an iterable allowing us to iterate over samples and posssible to also access them by index.
Simple to implement, but not very powerful.
The real benefit is when combining them with DataLoader.
A DataLoader samples a batch of samples from the dataset according to logic defined by a Sampler object.
The sampler decides how to partition the dataset into batches of N samples.
The DataLoader additionally handles loading samples in parallel to speed up creation of a batch.
A major motivation here is memory usage. When combining a DataLoader with a Dataset we can easily
control memory constraints by simply setting the batch size.
This is important since large datasets (e.g. ImageNet) do not fit in memory of most machines.
Since a Dataset can lazily load samples from disk on access,
and the DataLoader can sample random samples from it in parallel, we are provided with a simple
yet high-performance mechanism to iterate over random batches from our dataset without needing to
hold all of it in memory.
Let's create a basic DataLoader for our CIFAR-10 dataset.
Run the follwing code block multiple times and observe that different samples are shown each time in the first few batches.
# Create a simple DataLoader that partitions the data into batches
# of size N=8 in random order, using two background proceses
cifar10_train_dl = torch.utils.data.DataLoader(
cifar10_train_ds, batch_size=8, shuffle=True, num_workers=2
)
# Iterate over batches sampled with our DataLoader
num_batches_to_show = 5
for idx, (images, classes) in enumerate(cifar10_train_dl):
# The DataLoader returns a tuple of:
# images: Tensor of size NxCxWxH
# classes: Tensor of size N
fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
if idx >= num_batches_to_show - 1:
break
Here, we specified shuffle=True to the DataLoader. This automatically created a Sampler which just returns indices from the DataSet in a random order.
To better control the content of the batches, we can create our own custom sampler.
Imagine we want each batch to contain one sample from the beginning of the dataset and
another from the end. If we have N samples, we would like to get the following sequence of indices: [0, N-1, 1, N-2, 2, N-3, ...] and then use abatch_size of 2.
TODO Implement the FirstLastSampler class in the hw1/dataloaders.py module.
import hw1.dataloaders as hw1dataloaders
# Test sampler with odd number of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(5)))
test.assertEqual(list(sampler), [0,4, 1,3, 2,])
# Test sampler with evennumber of elements
sampler = hw1dataloaders.FirstLastSampler(list(range(6)))
test.assertEqual(list(sampler), [0,5, 1,4, 2,3])
# Create a DataLoader that partitions the data into batches
# of size N=2 in an order determined by our custom sampler
cifar10_train_dl = torch.utils.data.DataLoader(
cifar10_train_ds, batch_size=2, num_workers=0,
sampler=hw1dataloaders.FirstLastSampler(cifar10_train_ds),
)
# Iterate over batches sampled with our DataLoader
num_batches_to_show = 3
for idx, (images, classes) in enumerate(cifar10_train_dl):
fig, axes = plot.tensors_as_images(images, figsize=(8, 1))
fig.suptitle(f'Batch #{idx+1}:', x=0, y=0.6)
if idx >= num_batches_to_show - 1:
break
Now that we know about DataLoaders we can use them to do something useful: split a training dataset into Training and Validation sets.
A common issue in machine learning models is abundance of hyperparameters that must be selected prior to training the model on data. These hyperparameters may be part of the model itself or part of the training process. We would like to determine which hyperparameter selection can best fit the training data, and, more importantly, can be able to generalize to unseen data.
A prevalent approach is therefore to split the training dataset into two parts: One for actual training, i.e. tuning model parameters e.g. weights in the case of neural nets, and another for validation, i.e. comparing one model or set of hyperparameters to another. After the best model is selected (by seeking the minimal validation error), it can be retrained with the entire training set.

TODO Implement the function create_train_validation_loaders in the hw1/dataloaders.py module.
Use the following code block to check your implementation.
# Testing the train/validation split dataloaders
import hw1.dataloaders as hw1dataloaders
validation_ratio = 0.2
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(cifar10_train_ds, validation_ratio)
train_idx = set(dl_train.sampler.indices)
valid_idx = set(dl_valid.sampler.indices)
train_size = len(train_idx)
valid_size = len(valid_idx)
print('Training set size: ', train_size)
print('Validation set size: ', valid_size)
# Tests
test.assertEqual(train_size+valid_size, len(cifar10_train_ds), "Incorrect total number of samples")
test.assertEqual(valid_size, validation_ratio * (train_size + valid_size), "Incorrect ratio")
test.assertTrue(train_idx.isdisjoint(valid_idx), "Train and validation sets are not disjoint")
Training set size: 40000 Validation set size: 10000
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Determine whether each of the following statements is true or false, and explain why in detail:
display_answer(hw1.answers.part1_q1)
Your answer:
3.True. during cross validation we should not use the test set in order to prevent leakage. the test set should be use only for model evaluation. 4. False. the validation error at each fold only determine the model parameters inside the cross validation procedure (as each fold convert one train set to a validation set, using validation error as proxy for generalization error is biased)
Your friend has trained a simple linear regression model, e.g. $\hat{y}=\vectr{w}\vec{x}+b$, with some training data. He then evaluated it on a disjoint test-set and concluded that the model has over-fit the training set and therefore decided to add a regularization term $\lambda \norm{\vec{w}}^w$ to the loss, where $\lambda$ is a hyper parameter. In order to select the value of $\lambda$, your friend re-trained the model on his training set with different values of $\lambda$ and then chose the value which produced the best results on the test set.
Is your friend's approach justified? Explain why or why not.
display_answer(hw1.answers.part1_q2)
Your answer: no. choosing values for model hyperparameters should be done using training-set and cross validation procedure. using the test-set for that would cause leakage of the test-set into the training procedure. this way, any evaluation of the model using the test-set would be biased and would not reflect a true generalization error.
In this part, we'll familiarize ourselves with the PyTorch tensor API by implementing a very simple classifier,
kNN, using efficient, vectorized tensor operations alone.
We'll then implement cross-validation, an important ML technique used to find suitable
values for a model's hyperparameters.
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()
Arguably the most basic classification scheme in a supervised learning setting is the
k nearest-neighbor (kNN) classifier.
Given a training data set, kNN's "training" phase consists of simply memorizing it.
When a classification of an unseen sample is required, some distance metric (e.g. euclidean)
is computed from all training samples.
The unseen sample is then classified according to the majority label of it's k nearest-neighbors.
Here we'll implement the most basic kNN, working directly on image pixel values and computing L2 distance between a test image and every known training image. We'll use data from the MNIST database of handwritten digits. This database contains single-channel images with a constant black background and the digits are roughly the same size, which makes it feasible to obtain bearable classification accuracy even with such a naïve model.
Note however that real-world KNN model are often implemented with tree-based data structures to find nearest neighbors in logarithmic time, specialized distance functions and using image features instead of raw pixels.
TODO Implement the TensorView transform in the hw1/transforms module, and run the following code to
load the data we'll work with.
# Prepare data for kNN Classifier
import torchvision.transforms as tvtf
import cs236781.dataloader_utils as dataloader_utils
import hw1.datasets as hw1datasets
import hw1.transforms as hw1tf
# Define the transforms that should be applied to each CIFAR-10 image before returning it
tf_ds = tvtf.Compose([
tvtf.ToTensor(), # Convert PIL image to pytorch Tensor
hw1tf.TensorView(-1), # Reshape to 1D Tensor
])
# Define how much data to load (only use a subset for speed)
num_train = 10000
num_test = 1000
batch_size = 1024
# Training dataset & loader
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds), num_train)
dl_train = torch.utils.data.DataLoader(ds_train, batch_size)
# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds), num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)
# Get all test data
x_test, y_test = dataloader_utils.flatten(dl_test)
TODO Implement the l2_dist function in the hw1/knn_classifier.py module. This is the core of the kNN algorithm. You'll need to use broadcasting to implement it in an efficient, vectorized way (without loops).
x,y = ds_train[0]
x.shape
torch.Size([784])
import itertools as it
import hw1.knn_classifier as hw1knn
def l2_dist_naive(x1, x2):
"""
Naive distance calculation, just for testing.
Super slow, don't use!
"""
dists = torch.empty(x1.shape[0], x2.shape[0], dtype=torch.float)
for i, j in it.product(range(x1.shape[0]), range(x2.shape[0])):
dists[i,j] = torch.sum((x1[i] - x2[j])**2).item()
return torch.sqrt(dists)
# Test distance calculation
x1 = torch.randn(12, 34)
x2 = torch.randn(45, 34)
dists = hw1knn.l2_dist(x1, x2)
print(dists.shape)
dists_naive = l2_dist_naive(x1, x2)
test.assertTrue(torch.allclose(dists, dists_naive), msg="Wrong distances")
torch.Size([12, 45])
TODO Implement the accuracy function in the hw1/knn_classifier.py module.
This will be our score. It will simply return the fraction of predictions that are correct.
y1 = torch.tensor([0, 1, 2, 3])
y2 = torch.tensor([0, 1, 2, 2])
test.assertEqual(hw1knn.accuracy(y1, y2), 0.75)
TODO Complete the implementation of the KNNClassifier class in the module hw1/knn_classifier.py:
train() method.predict() method.Use the following code to test your implementations.
# Test kNN Classifier
knn_classifier = hw1knn.KNNClassifier(k=10)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)
# Calculate accuracy
accuracy = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy*100:.2f}%')
# Sanity check: at least 80% accuracy
test.assertGreater(accuracy, 0.8)
Accuracy: 91.50%
A common way to choose hyperparameters for a model or even the model itself is by applying
K-fold cross-validation (CV).
For each candidate set of hyperparameters, the model is trained K times, each time with a different split of the training data to train and validation sets (called a fold). The set of hyperparameters which resulted in the the lowest average validation error rate is selected.
More specifically, K-fold CV is usually performed as follows:
K non-overlapping parts.k=0,...,K-1:k-th part as the validation set and the remaining k-1 parts as the training set.Now we would like to find the best value of K for applying our kNN model to CIFAR-10.
In this case we already fixed the model and there is only one hyperparameter, the value of k
(not to be confused with K, the number of folds for the cross validation).
TODO Complete the implementation of the find_best_k function in the knn_classifier.py module.
num_folds = 4
k_choices = [1, 3, 5, 8, 12, 20, 50]
# Run cross-validation
best_k, accuracies = hw1knn.find_best_k(ds_train, k_choices, num_folds)
# Plot accuracies per k
_, ax = plt.subplots(figsize=(12,6), subplot_kw=dict(xticks=k_choices))
for i, k in enumerate(k_choices):
curr_accuracies = accuracies[i]
ax.scatter([k] * len(curr_accuracies), curr_accuracies)
accuracies_mean = np.array([np.mean(accs) for accs in accuracies])
accuracies_std = np.array([np.std(accs) for accs in accuracies])
ax.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
ax.set_title(f'{num_folds}-fold Cross-validation on k')
ax.set_xlabel('k')
ax.set_ylabel('Accuracy')
print('best_k =', best_k)
best_k = 1
Now that we found our best_k, we can train the model with that value of k on the full training set and evaluate the accuracy on the test set:
knn_classifier = hw1knn.KNNClassifier(k=best_k)
knn_classifier.train(dl_train)
y_pred = knn_classifier.predict(x_test)
# Calculate accuracy
accuracy_best_k = hw1knn.accuracy(y_test, y_pred)
print(f'Accuracy: {accuracy_best_k*100:.2f}%')
test.assertGreater(accuracy_best_k, accuracy)
Accuracy: 92.00%
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Does increasing k lead to improved generalization for unseen data? Why or why not? Up to what point? Think about the extremal values of k.
display_answer(hw1.answers.part2_q1)
Your answer: in KNN, K is related to bias variance trade-off. when K=1 the model fits exactly to the training data but will likely fail to generalize on new data. in terms of bias-variance this case correspond to low bias and high variance. the other extreme is the when K=N. now the model is underfitted - it will likely fail to learn the features from the training set and therefore will also fail to generalize (under the assumption that the new samples comes from the same distribution). this case correspond to high bias and low variance. the "sweet spot" is the minimum if the total error $err_{tot} = bias^2+vairance+noise$. we conclude that increasing K will reduce to generalization error up to a point where the total error starts to increase.
Explain why (i.e. in what sense) using k-fold CV, as detailed above, is better than:
display_answer(hw1.answers.part2_q2)
Your answer:
$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial {#1}}{\partial {#2}}} $
In this part we'll learn about loss functions and how to optimize them with gradient descent. We'll then use this knowledge to train a very simple model: a linear SVM.
import os
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
import unittest
import cs236781.dataloader_utils as dl_utils
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 12})
torch.random.manual_seed(1904)
test = unittest.TestCase()
In multi-class linear classification we have $C$ classes which we assume our samples may belong to. We apply a linear function to a sample $\vec{x} \in \set{R}^{D}$ and obtain a score $s_j$ which represents how well $x$ fits the class $1\leq j\leq C$ according to our model: $$ s_j = \vectr{w_j} \vec{x} + b_j. $$
Note that we have a different set of model parameters (weights) $\vec{w_j},~b_j$ for each class, so a total of $C\cdot(D+1)$ parameters.
To classify a sample, we simply calculate the score for each class and choose the class with the highest score as our prediction.
One interpretation of the weights $\vec{w_j},~b_j$ is that they represent the parameters of an $N$-dimensional hyperplane. Under this interpretation the class score $s_j$ of a sample is proportional to the distance of that sample from the hyperplane representing the $j$-th class. Note that this score can be positive or negative (depending on which side of the hyperplane the sample is). Such a classifier therefore splits the sample space into regions where the farther a sample is from the positive side of a hyperplane for class $j$, the higher $s_j$, so the more likely it belongs to class $j$.
In the context of supervised learning of a linear classifier model, we map a dataset (or batch from a dataset) of $N$ samples (for example, images flattened to vectors of length $D$) to a score for one of each of $C$ possible classes using the linear function above.
To make the implementation efficient, we'll represent the mapping with a single matrix multiplication, employing the "Bias trick": Instead of both $\vec{w_j}$ and $b_j$ per class, we'll put the bias term at the beginning of the weight vector and add a term $1$ at the start of each sample.
The class scores for each sample are then given by:
$$ \mat{S} = \mat{X} \mat{W} $$Where here (and in the code examples you'll work with),
Notes:
TODO Implement the BiasTrick transform class in the module hw1/transforms.py.
import hw1.transforms as hw1tf
tf_btrick = hw1tf.BiasTrick()
test_cases = [
torch.randn(64, 512),
torch.randn(2, 3, 4, 5, 6, 7),
torch.randint(low=0, high=10, size=(1, 12)),
torch.tensor([10, 11, 12])
]
for x_test in test_cases:
xb = tf_btrick(x_test)
print('shape =', xb.shape)
test.assertEqual(x_test.dtype, xb.dtype, "Wrong dtype")
test.assertTrue(torch.all(xb[..., 1:] == x_test), "Original features destroyed")
test.assertTrue(torch.all(xb[..., [0]] == torch.ones(*xb.shape[:-1], 1)), "First feature is not equal to 1")
shape = torch.Size([64, 513]) shape = torch.Size([2, 3, 4, 5, 6, 8]) shape = torch.Size([1, 13]) shape = torch.Size([4])
import torchvision.transforms as tvtf
# Define the transforms that should be applied to each image in the dataset before returning it
tf_ds = tvtf.Compose([
# Convert PIL image to pytorch Tensor
tvtf.ToTensor(),
# Normalize each chanel with precomputed mean and std of the train set
tvtf.Normalize(mean=(0.1307,), std=(0.3081,)),
# Reshape to 1D Tensor
hw1tf.TensorView(-1),
# Apply the bias trick (add bias element to features)
hw1tf.BiasTrick(),
])
The following code will use your transform to load a subset of the MNIST dataset for us to work with.
import hw1.datasets as hw1datasets
import hw1.dataloaders as hw1dataloaders
# Define how much data to load
num_train = 10000
num_test = 1000
batch_size = 1000
# Training dataset
data_root = os.path.expanduser('~/.pytorch-datasets')
ds_train = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=True, transform=tf_ds),
num_train)
# Create training & validation sets
dl_train, dl_valid = hw1dataloaders.create_train_validation_loaders(
ds_train, validation_ratio=0.2, batch_size=batch_size
)
# Test dataset & loader
ds_test = hw1datasets.SubsetDataset(
torchvision.datasets.MNIST(root=data_root, download=True, train=False, transform=tf_ds),
num_test)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size)
x0, y0 = ds_train[0]
n_features = torch.numel(x0)
print(n_features)
n_classes = 10
# Make sure samples have bias term added
test.assertEqual(n_features, 28*28*1+1, "Incorrect sample dimension")
785
TODO Complete the implementation of the __init()__, predict() and evaluate_accuracy() functions in the
LinearClassifier class located in the hw1/linear_classifier.py module.
import hw1.linear_classifier as hw1linear
# Create a classifier
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
# Evaluate accuracy on test set
mean_acc = 0
for (x,y) in dl_test:
y_pred, _ = lin_cls.predict(x)
mean_acc += lin_cls.evaluate_accuracy(y, y_pred)
mean_acc /= len(dl_test)
print(f"Accuracy: {mean_acc:.1f}%")
Accuracy: 6.4%
You should get an accuracy of around 10%, corresponding to a random guess of one of ten classes. You can run the above code block multiple times to sample different initial weights and get slightly different results.
We have seen that a linear model computes the class scores for each sample using a linear mapping as a score function. However in order to train the model, we need to define some measure of how well we've classified our samples compared to their ground truth labels. This measure is known as a loss function, and it's selection is crucial in determining the model that will result from training. A loss function produces lower values the better the classification is.
A very common linear model for classification is the Support Vector Machine. An SVM attempts to find separating hyperplanes that have the property of creating a maximal margin to the training samples, i.e. hyperplanes that are as far as possible from the closest training samples. For example, in the following image we see a simple case with two classes of samples that have only two features. The data is linearly separable and it's easy to see there are infinite possible hyperplanes (in this case lines) that separate the data perfectly.
The SVM model finds the optimal hyperplane, which is the one with the maximal margin. The data points closest to the separating hyperplane are called the Support Vectors (it can be shown that only they determine the hyperplane). We can see that the width of the margin is $\frac{2}{\norm{\vec{w}}}$. In this simple case since the data is linearly separable, there exists a solution where no samples fall within the margin. If the data is not linearly separable, we need to allow samples to enter the margin (with a cost). This is known as a soft-margin SVM.
There are many ways to train an SVM model. Classically, the problem is stated as constrained optimization and solved with quadratic optimization techniques. In this exercise, we'll instead work directly with the uncontrained SVM loss function, calculate it's gradient analytically, and then minimize it with gradient descent. As we'll see in the rest of the course, this technique will be a major component when we train deep neural networks.
The in-sample (empirical) loss function for a multiclass soft-margin SVM can be stated as follows:
$$ L(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} L_{i}(\mat{W}) + \frac{\lambda}{2} \norm{\mat{W}}^2 $$Where the first term is the mean pointwise data-dependent loss $L_{i}$, given by the hinge loss formula,
$$ L_{i}(\mat{W}) = \sum_{j \neq y_i} \max\left(0, \Delta+ \vectr{w_j} \vec{x_i} - \vectr{w_{y_i}} \vec{x_i}\right), $$and the second term is a regularization loss which depends only on model parameters. Note that the hinge loss term sums over the wrong class prediction scores for each sample: $j\neq y_i$, and $y_i$ is the ground-truth label for sample $i$. This can be understood as attempting to make sure that the score for the correct class is higher than the other classes by at least some margin $\Delta > 0$, otherwise a loss is incurred. This way, we allow samples to fall within the margin but incur loss, which gives us a soft-margin SVM.
The regularization term penalizes large weight magnitudes to prevent ambiguous solutions since if e.g. $\mat{W^*}$ is a weight matrix that perfectly separates the data, so is $\alpha\mat{W^*}$ for any scalar $\alpha \geq 1$.
Fitting an SVM model then amounts to finding the weight matrix $\mat{W}$ which minimizes $L(\mat{W})$. Note that we're writing the loss as a function of $\mat{W}$ to emphasize that we wish to minimize it's value on the given data by with respect to the weights $\mat{W}$, even though it obviously depends also on our specific dataset, $\left\{ \vec{x_i}, y_i \right\}_{i=1}^{N}$.
TODO Implement the SVM hinge loss function in the module hw1/losses.py, within the SVMHingeLoss class.
Implement just the loss() function. For now you can ignore the part about saving tensors for the gradient calculation. Run the following to test.
import cs236781.dataloader_utils as dl_utils
from hw1.losses import SVMHingeLoss
torch.random.manual_seed(42)
# Classify all samples in the test set
# because it doesn't depend on randomness of train/valid split
x, y = dl_utils.flatten(dl_test)
# Compute predictions
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
y_pred, x_scores = lin_cls.predict(x)
# Calculate loss with our hinge-loss implementation
loss_fn = SVMHingeLoss(delta=1.)
loss = loss_fn(x, y, x_scores, y_pred)
torch_loss = torch.nn.MultiMarginLoss()
loss2 = torch_loss(x_scores, y).item()*n_classes
# Compare to pre-computed expected value as a test
expected_loss = 9.0233
print("loss =", loss.item())
print('diff =', abs(loss.item()-expected_loss))
test.assertAlmostEqual(loss.item(), expected_loss, delta=1e-2)
loss = 9.023382186889648 diff = 8.218688964767296e-05
In this section we'll implement a simple gradient descent optimizer for the loss function we've implemented above. As you have seen in the lectures, the basic gradient-based optimization scheme is as follows:
The crucial component here is the gradient calculation. In this exercise we'll analytically derive the gradient of the loss and then implement it in code. In the next parts of the course we'll enjoy the automatic-differentiation features of PyTorch, but for now we'll do it the old-fashioned way.
An important detail to note is that while $L(\mat{W})$ is scalar-valued, it's a function of all the elements of the matrix $\mat{W}$. Therefore it's gradient w.r.t. $\mat{W}$ is also a matrix of the same shape as $\mat{W}$:
$$ \nabla_{\mat{W}} L = \begin{bmatrix} \frac{\partial L}{\partial W_{1,1}} & & \cdots & \frac{\partial L}{\partial W_{1,C}} \\ \frac{\partial L}{\partial W_{2,1}} & \ddots & \\ \vdots & & \ddots & \\ \frac{\partial L}{\partial W_{D+1,1}} & \cdots & & \frac{\partial L}{\partial W_{D+1,C}} \\ \end{bmatrix} = \begin{bmatrix} \vert & & \vert \\ \frac{\partial L}{\partial\vec{w_1}} & \cdots & \frac{\partial L}{\partial\vec{w_C}}\\ \vert & & \vert \\ \end{bmatrix} \in \set{R}^{(D+1)\times C}. $$For our gradient descent update-step we'll need to create such a matrix of derivatives and evaluate it at the current value of the weight matrix.
The first thing we need to do is formulate an expression for the gradient of the loss function defined above. Since the expression for the loss depends on the columns of $\mat{W}$, we'll derive an expression for the gradient of $L(\mat{W})$ w.r.t. each $\vec{w_j}$:
$$ \pderiv{L}{\vec{w_j}}(\mat{W}) = \frac{1}{N} \sum_{i=1}^{N} \pderiv{L_{i}}{\vec{w_j}}(\mat{W}) + \lambda \vec{w_j}. $$To compute the gradient of the pointwise loss, let's define the margin-loss of sample $i$ for class $j$ as follows: $m_{i,j} = \Delta + \vectr{w_j}\vec{x_i} - \vectr{w_{y_i}}\vec{x_i}$. We can then write the pointwise loss and it's gradient in terms of $m_{i,j}$. We'll separate the case of $j=y_i$ (i.e. the gradient for the correct class):
$$ \begin{align} \pderiv{L_i}{\vec{w_j}} & = \begin{cases} \vec{x_i}, & m_{i,j}>0 \\ 0, & \mathrm{else} \\ \end{cases} ,~j \neq y_i \\ \\ \pderiv{L_i}{\vec{w_{y_i}}} & = -\vec{x_i} \sum_{j\neq y_i} \mathbb{1}\left( m_{i,j} > 0 \right) \end{align} $$Where $\mathbb{1}(\cdot)$ is an indicator function that takes the value $1$ if it's argument is a true statement, else it takes $0$.
Note: the hinge-loss function is not strictly speaking differentiable due to the $\max$ operator. However, in practice it's not a major concern. Given that we know what argument the $\max$ "chooses", we can differentiate each one of them separately. This is known as a sub-gradient. In the above, when $m_{i,j} \leq 0$ we know the gradient will simply be zero.
TODO Based on the above, implement the gradient of the loss function in the module hw1/losses.py,
within the SVMHingeLoss class.
Implement the grad() function and complete what's missing in the loss() function.
Make sure you understand the above gradient derivation before attempting to implement it.
Note: you'll be implementing only the first term in the above equation for $\pderiv{L}{\vec{w_j}}(\mat{W})$. We'll add the regularization term later.
# Create a hinge-loss function
loss_fn = SVMHingeLoss(delta=1)
# Compute loss and gradient
loss = loss_fn(x, y, x_scores, y_pred)
grad = loss_fn.grad()
# Sanity check only (not correctness): compare the shape of the gradient
test.assertEqual(grad.shape, lin_cls.weights.shape)
But in the above we only checked the shape, how do we know if we've implemented the gradient correctly?
One approach is to recall the formal definition of the derivative, i.e. $$ f'(x)=\lim_{h\to 0} \frac{f(x+h)-f(x)}{h}. $$ Another way to put this is that for a small enough $h$, $$ f(x+h)\approx f(x)+f'(x)\cdot h. $$
We can use this approach to implement a gradient check by applying very small perturbations of each weight (separately) and using the above formula to check the correctness of the gradient (up to some tolerance). This is called a numerical gradient check.
Here we'll use a different approach, just to get a taste of the concept of automatic differentiation, which we'll rely on heavily in the rest of the course.
In the simple linear model we worked with, the gradient was fairly straightforward to derive analytically and implement. However for complex models such as deep neural networks with many layers and non-linear operations between them this is not the case. Additionally, the gradient must be re-derived any time either the model architecture or the loss function changes. These things make it infeasible in practice to perform deep-learning research using this manual method of gradient derivation. Therefore, all deep-learning frameworks provide a mechanism of automatic differentiation, to prevent the user from needing to manually derive the gradients of loss functions.
PyTorch provides this functionality in a package named torch.autograd which we will use further on in the
next exercises.
For now, here's an example showing that autograd can compute the gradient of the loss function you've implemented.
TODO Run the following code block. Try to understand how autograd is used and why. If the test fails, go back and fix your gradient calculation.
from hw1.losses import SVMHingeLoss
# Create a new classifier and loss function
lin_cls = hw1linear.LinearClassifier(n_features, n_classes)
loss_fn = SVMHingeLoss(delta=1)
# Specify we want the gradient to be saved for the weights tensor
# (just for our test)
lin_cls.weights.requires_grad = True
# Forward pass using the weights tensor, calculations will be tracked
y_pred, x_scores = lin_cls.predict(x)
# Compute loss of predictions and their analytic gradient
loss = loss_fn(x, y, x_scores, y_pred)
grad = loss_fn.grad()
# Compute gradient with autograd
autograd_grad = torch.autograd.grad(loss, lin_cls.weights)[0]
# Calculate the difference between analytic and autograd
diff = torch.norm(grad - autograd_grad).item()
print('loss =', loss.item())
print('grad =\n', grad)
print('autograd =\n', autograd_grad)
print('diff =', diff)
test.assertLess(diff, 1e-3, "Gradient diff was too large")
loss = 8.961071968078613
grad =
tensor([[ 0.1500, -0.2600, -0.1600, ..., 0.0100, 0.1100, 0.0600],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
...,
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255]])
autograd =
tensor([[ 0.1500, -0.2600, -0.1600, ..., 0.0100, 0.1100, 0.0600],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
...,
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255],
[-0.0636, 0.1103, 0.0679, ..., -0.0042, -0.0467, -0.0255]])
diff = 2.1328307411749847e-05
Generally, solving a machine-learning problem requires defining the following components:
Now that we have implemented our loss function and it's gradient, we can finally train our model.
Implementation notes:
PyTorch provides very effective mechanisms to implement all of
these components in a decoupled manner.weight_decay parameter. The reason is that we prefer that the part of the loss which only depends
on the model parameters be part of the optimizer, not the loss function (though both ways are possible).
You'll see this pattern later on when you use PyTorch's optimizers in the torch.optim package.TODO
LinearClassifier's train() method.
Use mini-batch SGD for the weight update rule.hyperparams function.
You should play with the hyperparameters to get a feel for what they do to the
loss and accuracy graphs.hp = hw1linear.hyperparams()
print('hyperparams =', hp)
lin_cls = hw1linear.LinearClassifier(n_features, n_classes, weight_std=hp['weight_std'])
# Evaluate on the test set
x_test, y_test = dl_utils.flatten(dl_test)
x_train, y_train = dl_utils.flatten(dl_train)
y_test_pred , _= lin_cls.predict(x_test)
test_acc_before = lin_cls.evaluate_accuracy(y_test, y_test_pred)
# Train the model
svm_loss_fn = SVMHingeLoss()
train_res, valid_res = lin_cls.train(dl_train, dl_valid, svm_loss_fn,
learn_rate=hp['learn_rate'], weight_decay=hp['weight_decay'],
max_epochs=30)
# Re-evaluate on the test set
y_test_pred , _= lin_cls.predict(x_test)
test_acc_after = lin_cls.evaluate_accuracy(y_test, y_test_pred)
# Plot loss and accuracy
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
for i, loss_acc in enumerate(('loss', 'accuracy')):
axes[i].plot(getattr(train_res, loss_acc))
axes[i].plot(getattr(valid_res, loss_acc))
axes[i].set_title(loss_acc.capitalize(), fontweight='bold')
axes[i].set_xlabel('Epoch')
axes[i].legend(('train', 'valid'))
axes[i].grid(which='both', axis='y')
# Check test set accuracy
print(f'Test-set accuracy before training: {test_acc_before:.1f}%')
print(f'Test-set accuracy after training: {test_acc_after:.1f}%')
print(f'Validation-set accuracy after training: {valid_res.accuracy[-1]:.1f}%')
print(f'Train-set accuracy after training: {train_res.accuracy[-1]:.1f}%')
test.assertGreaterEqual(test_acc_after, 85.0)
hyperparams = {'weight_std': 0.001, 'learn_rate': 0.02, 'weight_decay': 0.1}
Training..............................
Test-set accuracy before training: 13.5%
Test-set accuracy after training: 89.0%
Validation-set accuracy after training: 90.3%
Train-set accuracy after training: 93.0%
Even though this is a very naïve model, you should get at least 85% test set accuracy if you implemented training correctly. You can try to change the hyperparameters and see whether you get better results. Generally this should be done with cross-validation.
One way to understand what models learn is to try to visualize their learned parameters. There can be many ways to do this. Let's try a very simple one, which is to reshape them into images of the input size and see what they look like.
TODO Implement the weights_as_images() function in the LinearClassifier class.
import cs236781.plot as plot
w_images = lin_cls.weights_as_images(img_shape=(1,28,28))
fig, axes = plot.tensors_as_images(list(w_images))
Additionally, we can better understand the model by plotting some samples and looking at wrong predictions. Run the following block to visualize some test-set examples and the model's predictions for them.
# Plot some images from the test set and their predictions
n_plot = 104
x_test, y_test = next(iter(dl_test))
x_test = x_test[0:n_plot]
y_test = y_test[0:n_plot]
y_test_pred, _ = lin_cls.predict(x_test)
x_test_img = torch.reshape(x_test[:, :-1], (n_plot, 1, 28, 28))
fig, axes = plot.tensors_as_images(list(x_test_img), titles=y_test_pred.numpy(),
nrows=8, hspace=0.5, figsize=(10,8), cmap='gray')
# Highlight the wrong predictions
wrong_pred = y_test_pred != y_test
wrong_pred_axes = axes.ravel()[wrong_pred.numpy().astype(bool)]
for ax in wrong_pred_axes:
ax.title.set_color('red')
ax.title.set_fontweight('bold')
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Explain why the selection of $\Delta > 0$ is arbitrary for the SVM loss $L(\mat{W})$ as it is defined above (the full in-sample loss, with the regularization term).
display_answer(hw1.answers.part3_q1)
Your answer: $\Delta$ is irrelevant because we have the regularization term, $\lambda$, which we can tune. as $\Delta$ sepcify the margin size, $\lambda$ allows samples to to be inside the margin. the "cost" of sample being inside the margin determined by $\lambda$. for a given dataset we can either change $\Delta$ and fix $\lambda$ or change $\lambda$ and fix $\Delta$ and there is a correspondence between the two cases (for each value of $\Delta$ with fixed $\lambda$ there's a value of $\lambda$ with fixed $\Delta$ that will have the same accuracy). in our case $\lambda$ id the hyperparameter so $\Delta$ can be chosen arbitrarily
Given the images in the visualization section above,
display_answer(hw1.answers.part3_q2)
Your answer:
the shape of the corresponding class label. the representation is based in the training set and if we have in the test set an image which is very different from the representation of its label in the train set, but more close to a different representation, the model will miscalsiffied the image as the later class. we can see it happened between digits 7 and 2 (these are similiar digits so the probability for this to occur in those digits is higher). 2. In KNN there is no actual learning. the classification in based only on neighbouring samples. In LinearClassification model there is no affect of neighbouring samples. The model learn the unique features of each label and classify based on the similiarity between sample's features and the learned features.
Explain your answer by describing what the loss graph would look like in the other two cases when training for the same number of epochs.
and why?
display_answer(hw1.answers.part3_q3)
Your answer:
would not decrease. in case of too low learning rate the loss would decrease very slowly and we would not reach saturation (plateau in the loss and accuracy). since in our graphs we see steep decrease (or increase in the case of accuracy) and than a more flat area, in conclude that the learning rate is good.
$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial {#1}}{\partial {#2}}} $
In this part we'll perform the classic machine learning task of linear regression.
We'll do some simple data exploration and feature engineering,
like in the pre-deep-learning days.
Our solution will be implemented using some very widely used machine-learning python libraries
(numpy,
scikit-learn and
pandas).
We'll then explore the generalization capacity of the model and perform cross validation.
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import unittest
%matplotlib inline
%load_ext autoreload
%autoreload 2
plt.rcParams.update({'font.size': 14})
np.random.seed(42)
test = unittest.TestCase()
We'll be working with the Boston housing dataset. This is a famous toy dataset for benchmarking regression algorithms.
The dataset contains 506 samples of median house values in Boston, each with 13 associated house and neighborhood attributes (features; see link for their meaning).
The 13 features of each house are our independent variables, and we're trying to predict the value of MEDV, the median house price (in units of $1000).
Run the following block to load the data. Since this dataset is very small, we can load it directly into memory and forgo any lazy-loading mechanisms.
import sklearn.datasets
import warnings
# Load data we'll work with - Boston housing dataset
# We'll use sklearn's built-in data
with warnings.catch_warnings():
warnings.simplefilter("ignore")
ds_boston = sklearn.datasets.load_boston()
feature_names = list(ds_boston.feature_names)
n_features = len(feature_names)
x, y = ds_boston.data, ds_boston.target
n_samples = len(y)
print(f'Loaded {n_samples} samples')
Loaded 506 samples
Let's use pandas to visualize the independent and target variables.
We'll just show the first 10 samples.
# Load into a pandas dataframe and show some samples
df_boston = pd.DataFrame(data=x, columns=ds_boston.feature_names)
df_boston = df_boston.assign(MEDV=y)
df_boston.head(10).style.background_gradient(subset=['MEDV'], high=1.)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.006320 | 18.000000 | 2.310000 | 0.000000 | 0.538000 | 6.575000 | 65.200000 | 4.090000 | 1.000000 | 296.000000 | 15.300000 | 396.900000 | 4.980000 | 24.000000 |
| 1 | 0.027310 | 0.000000 | 7.070000 | 0.000000 | 0.469000 | 6.421000 | 78.900000 | 4.967100 | 2.000000 | 242.000000 | 17.800000 | 396.900000 | 9.140000 | 21.600000 |
| 2 | 0.027290 | 0.000000 | 7.070000 | 0.000000 | 0.469000 | 7.185000 | 61.100000 | 4.967100 | 2.000000 | 242.000000 | 17.800000 | 392.830000 | 4.030000 | 34.700000 |
| 3 | 0.032370 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 6.998000 | 45.800000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 394.630000 | 2.940000 | 33.400000 |
| 4 | 0.069050 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 7.147000 | 54.200000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 396.900000 | 5.330000 | 36.200000 |
| 5 | 0.029850 | 0.000000 | 2.180000 | 0.000000 | 0.458000 | 6.430000 | 58.700000 | 6.062200 | 3.000000 | 222.000000 | 18.700000 | 394.120000 | 5.210000 | 28.700000 |
| 6 | 0.088290 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.012000 | 66.600000 | 5.560500 | 5.000000 | 311.000000 | 15.200000 | 395.600000 | 12.430000 | 22.900000 |
| 7 | 0.144550 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.172000 | 96.100000 | 5.950500 | 5.000000 | 311.000000 | 15.200000 | 396.900000 | 19.150000 | 27.100000 |
| 8 | 0.211240 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 5.631000 | 100.000000 | 6.082100 | 5.000000 | 311.000000 | 15.200000 | 386.630000 | 29.930000 | 16.500000 |
| 9 | 0.170040 | 12.500000 | 7.870000 | 0.000000 | 0.524000 | 6.004000 | 85.900000 | 6.592100 | 5.000000 | 311.000000 | 15.200000 | 386.710000 | 17.100000 | 18.900000 |
Let's explore the data a bit by plotting a scatter matrix of every variable as a function of every other and a histogram for each.
pd.plotting.scatter_matrix(df_boston, figsize=(20,20));
The above chart shows us (among other things) how our target variable MEDV behaves as a function
of the features (bottom row). By looking at it, can you guess which relationships might be good candidates for a linear model?
Let's use a simple method for deciding which features to use for our linear model: the correlation coefficient, defined as
$$ \rho_{\vec{x}\vec{y}} = \frac{\sigma_{\vec{x}\vec{y}}}{\sigma_{\vec{x}} \sigma_{\vec{y}}} = \frac {\sum_{i=1}^{N} (x_i - \mu_\vec{x}) (y_i - \mu_\vec{y}) } {\sqrt{\sum_{i=1}^{N} (x_i - \mu_\vec{x})^2} \cdot \sqrt{\sum_{i=1}^{N} (y_i - \mu_\vec{y})^2}} $$Where $\vec{x}, \vec{y}$ are $N$ samples of two variables and $\mu, \sigma$ refer to sample means and (co-)variances respectively. The value of $\rho$ is $\pm 1$ for perfect positive or negative linear relationships ($y=ax+b$), and somewhere in between when it's not perfect. Note that this coefficient is rather limited: even when $\rho=0$, the variables may be highly dependent, just not in a linear fashion.
Let's implement this method to find out which features we should include in our initial linear model.
TODO Implement the top_correlated_features() function in the hw1/linear_regression.py module.
import hw1.linear_regression as hw1linreg
n_top_features = 5
top_feature_names, top_corr = hw1linreg.top_correlated_features(df_boston, 'MEDV', n_top_features)
print('Top features: ', top_feature_names)
print('Top features correlations: ', top_corr)
# Tests
test.assertEqual(len(top_feature_names), n_top_features)
test.assertEqual(len(top_corr), n_top_features)
test.assertAlmostEqual(np.sum(np.abs(top_corr)), 2.893, delta=1e-3) # compare to precomputed value for n=5
Top features: ['LSTAT', 'RM', 'PTRATIO', 'INDUS', 'TAX'] Top features correlations: [-0.7376627261740146, 0.6953599470715393, -0.5077866855375619, -0.4837251600283728, -0.4685359335677669]
Arguably the simplest machine learning model is linear regression. We are given a dataset $\left\{\vec{x}^{(i)}, y^{(i)}\right\}_{i=1}^{N}$ where $\vec{x}^{(i)} \in \set{R}^D$ is a $D$-dimensional feature vector and $y^{(i)}\in\set{R}$ is a continuous quantity assumed to be the output of some unknown function, i.e. $y^{(i)} = f(\vec{x}^{(i)})$.
Our goal will be to fit a linear transformation, parametrized by weights vector and bias term $\vec{w}, b$, such that given a sample $\vec{x}$ our prediction is
$$ \hat{y} = \vectr{w}\vec{x} + b. $$We'll judge the performance of the model using the ordinary least-squares sense, i.e. with a loss function of given by the mean-squared error (MSE) with the addition of an L2-regularization term: $$ L(\vec{w}) = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \hat{y}^{(i)} \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2 = \frac{1}{2N} \sum_{i=1}^{N} \left( y^{(i)} - \vectr{w}\vec{x}^{(i)} - b \right)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2_2. $$
Minimizing the above $L(\vec{w})$ is a simple convex optimization problem with a closed-form solution. Of course, this can also be solved using iterative descent methods which are necessary when the data is too large to fit in memory.
As a warm up with numpy, let's implement the bias trick again (this time using numpy and as a sklearn transformation)
so that our linear regression model will operate on data with an added bias term.
TODO Implement the class BiasTrickTransformer in the hw1/linear_regression.py module.
# Test BiasTrickTransformer
bias_tf = hw1linreg.BiasTrickTransformer()
test_cases = [
np.random.randint(10, 20, size=(5,2)),
np.random.randn(10, 1),
]
for xt in test_cases:
xb = bias_tf.fit_transform(xt)
print(xb.shape)
test.assertEqual(xb.ndim, 2)
test.assertTrue(np.all(xb[:,0] == 1))
test.assertTrue(np.all(xb[:, 1:] == xt))
(5, 3) (10, 2)
Lets now define a function to assess the accuracy of our models prediction (loss and score). We'll use the MSE loss as above and $R^2$ as a score. Note that $R^2$ is a number in the range [0, 1] which represents how much better the regression fits the data in compared to a simple average of the data. It is given by $$ R^2 = 1-\frac{\sum_i (e^{(i)})^2}{\sum_i (y^{(i)} - \bar{y})^2}, $$ where $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ is known as the residual for each sample $i$ and $\bar{y}$ is the data mean.
TODO Implement the mse_score and r2_score function in the hw1/linear_regression.py module.
def evaluate_accuracy(y: np.ndarray, y_pred: np.ndarray):
"""
Calculates mean squared error (MSE) and coefficient of determination (R-squared).
:param y: Target values.
:param y_pred: Predicted values.
:return: A tuple containing the MSE and R-squared values.
"""
mse = hw1linreg.mse_score(y, y_pred)
rsq = hw1linreg.r2_score(y, y_pred)
return mse, rsq
Of course, these measures and many others are built-in to sklearn. We'll use these to test.
from sklearn.metrics import r2_score as r2, mean_squared_error as mse
for i in range(10):
test_y = np.random.randn(20)
test_y_pred = np.random.randn(20)
mse_actual, r2_actual = evaluate_accuracy(test_y, test_y_pred)
mse_expected, r2_expected = mse(test_y, test_y_pred), r2(test_y, test_y_pred)
test.assertAlmostEqual(mse_actual, mse_expected, delta=1e-6)
test.assertAlmostEqual(r2_actual, r2_expected, delta=1e-6)
Now we can implement our model.
TODO Based on the above equations for the model and loss, implement the predict() and fit()
functions in the LinearRegressor class within the module linear_regression.py.
You'll need to first derive the closed-form solution for the optimal $\vec{w}$ based on the loss.
Run the code block below to fit your model to each of the 5 top
features you selected (one at a time).
A very useful feature of sklearn is pipelines: We can create a composite model made of multiple
steps which transform the features (using fit_transform) and a final step which calculates the actual
model predictions (using fit_predict()). Each step in the pipeline should be an sklearn Estimator
instance and implement the appropriate methods.
For example, lets create a pipeline that scales each input feature to zero-mean and unit variance, applies our bias-trick transformation and finally uses our Linear Regression model.
import sklearn.preprocessing
import sklearn.pipeline
# Create our model as a pipline:
# First we scale each feature, then the bias trick is applied, then the regressor
model = sklearn.pipeline.make_pipeline(
sklearn.preprocessing.StandardScaler(),
hw1linreg.BiasTrickTransformer(),
hw1linreg.LinearRegressor(),
)
# Test the model implementation is correct
y_pred = model.fit_predict(x, y)
full_dataset_mse, _ = evaluate_accuracy(y, y_pred)
test.assertEqual(y_pred.shape, y.shape)
test.assertAlmostEqual(full_dataset_mse, 22.660, delta=1e-1)
From here we'll use our pipleline as a model.
We want to now check the predictive power of different features. First, we'll implement a small helper function that will allow us to fit a model on a subset of features from our dataframe.
TODO Implement the fit_predict_dataframe function in the linear_regression.py module.
# Full dataset
y_pred = hw1linreg.fit_predict_dataframe(
model, df_boston, target_name='MEDV'
)
test.assertAlmostEqual(full_dataset_mse, evaluate_accuracy(y,y_pred)[0], delta=1e-1)
# Subset of features
y_pred = hw1linreg.fit_predict_dataframe(
model, df_boston, target_name='MEDV', feature_names=['CHAS', 'B']
)
test.assertAlmostEqual(72.982, evaluate_accuracy(y,y_pred)[0], delta=1e-1)
We'll use each feature separately and fit multiple times to get an idea of the predictive power of each of our top-5 features.
fig, ax = plt.subplots(nrows=1, ncols=n_top_features, sharey=True, figsize=(20,5))
actual_mse = []
# Fit a single feature at a time
for i, feature_name in enumerate(top_feature_names):
y_pred = hw1linreg.fit_predict_dataframe(model, df_boston, 'MEDV', [feature_name])
mse, rsq = evaluate_accuracy(y, y_pred)
# Plot
xf = df_boston[feature_name].values.reshape(-1, 1)
x_line = np.arange(xf.min(), xf.max(), 0.1, dtype=float).reshape(-1, 1)
y_line = model.predict(x_line)
ax[i].scatter(xf, y, marker='o', edgecolor='black')
ax[i].plot(x_line, y_line, color='red', lw=2, label=f'fit, $R^2={rsq:.2f}$')
ax[i].set_ylabel('MEDV')
ax[i].set_xlabel(feature_name)
ax[i].legend(loc='upper right')
actual_mse.append(mse)
# Test regressor implementation
print(actual_mse)
expected_mse = [38.862, 43.937, 62.832, 64.829, 66.040]
for i in range(len(expected_mse)):
test.assertAlmostEqual(expected_mse[i], actual_mse[i], delta=1e-1)
[38.86260846068978, 43.93789891484722, 62.83209551907833, 64.82947233954711, 66.0404346824534]
As you can see, the results are not great. We can't reliably predict the target variable based on just one of these. Now let's fit a model based on the combined top-5 features. Since it's difficult to visualize high-dimensional hyperplanes, instead of plotting the data and fitted hyperplane, we'll create a residuals plot. This is the plot of the error, or residual $e^{(i)} = y^{(i)} - \hat{y}^{(i)}$ vs. the predicted value $\hat{y}^{(i)}$.
# Fit top-5 features
y_pred = hw1linreg.fit_predict_dataframe(model, df_boston, 'MEDV', top_feature_names)
mse5, rsq5 = evaluate_accuracy(y, y_pred)
print(f'mse5={mse5:.2f}, rsq5={rsq5:.2f}')
# Residuals plot
def plot_residuals(y, y_pred, ax=None, res_label=None):
if ax is None:
_, ax = plt.subplots()
res = y - y_pred
ax.scatter(y_pred, y_pred-y, marker='s', edgecolor='black', label=res_label)
ax.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3)
ax.hlines(y=[-res.std(), res.std()], xmin=y_pred.min(), xmax=y_pred.max(), color='red', lw=3, linestyles=':')
ax.set_xlabel(r'$\hat{y}$')
ax.set_ylabel(r'$y - \hat{y}$')
if res_label is not None:
ax.legend()
return ax
plot_residuals(y, y_pred)
# Sanity test
test.assertLess(mse5, 30)
mse5=27.20, rsq5=0.68
That's better, but there's still more to be desired. Let's try to improve our model further.
We can see that from the scatter matrix that some of the relationships between our features and target variable are obviously not linear and cannot be modeled completely by fitting lines (or hyperplanes). Is there a way to fit a non-linear function to the data (such as a polynomial) but still use the simplicity of the Linear Regression model?
Suppose we have 2-dimensional feature vectors, $\vec{x}=(x_1, x_2)$. We can fit a linear regression model with 3 parameters which represents some 2-d plane. However if we transform each such feature vector, for example by $\vec{\tilde{x}} = (x_1, x_2, x_1^2, x_1 x_2, x_2^2)$, then we can now fit a model with 6 parameters to the same data. We can thus increase the capacity of our model (its ability to fit a wide variety of functions) by adding more parameters that correspond to non-linear transformations of the features.
Let's implement some hand-crafted nonlinear features based on all the features in the dataset. This step in the machine learning process is sometimes also referred to as feature engineering. In the rest of the course, you'll see how Deep Learning allows us to learn the features themselves instead of creating them by hand, and thus creating very powerful representations.
TODO Implement the BostonFeaturesTransformer class in the hw1/linear_regression.py module.
You can create any features you want, for example given $\vec{x}=(x_1,x_2)$ you could generate features
such as $x_1^2$, $x_1 \log{x_2}$, $e^{-x_1}$ and so on.
Try to "engineer" features by inferring relationships based on the scatter matrix.
Notes:
PolynomialFeatures from sklearn.preprocessing
to simplify generation of polynomial features.def linreg_boston(model, x, y, fit=True):
if fit:
model.fit(x, y)
y_pred = model.predict(x)
mse, rsq = evaluate_accuracy(y, y_pred)
return y_pred, mse, rsq
# Fit with all features this time
x = df_boston[feature_names].values
# Use model with a custom features transform
model = sklearn.pipeline.make_pipeline(
hw1linreg.BiasTrickTransformer(),
hw1linreg.BostonFeaturesTransformer(),
hw1linreg.LinearRegressor()
)
y_pred, mse, rsq = linreg_boston(model, x, y)
plot_residuals(y, y_pred)
# Test: You should get at least 2x lower loss than previously, easily even lower
print(f'target_mse={mse5/2:.3f}')
print(f'mse={mse:.2f}, rsq={rsq:.2f}')
test.assertLess(mse, mse5 / 2)
target_mse=13.601 mse=8.25, rsq=0.90
By now, your model should produce fairly accurate predictions. Note howerver that we trained it on the entire Boston dataset.
When training models, we don't actually care about their performance on the training data; we're not interested in solving optimization problems. What we want is the ability to generalize: How well will it perform on novel, unseen data? In other words, did the model learn some function similar to the one actually generating the samples?
Let's find out how good our model is for unseen data the usual way: We'll split our dataset into a training and test set.
from sklearn.model_selection import train_test_split
# Data and model
x = df_boston[feature_names].values
y = df_boston['MEDV'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = sklearn.pipeline.make_pipeline(
hw1linreg.BiasTrickTransformer(),
hw1linreg.BostonFeaturesTransformer(),
hw1linreg.LinearRegressor()
)
However, instead of just fitting the model on the training set and evaluating on the test set, we'll use cross-validation to find a set of model hyperparameters that allow the model to generalize well.
We'll again use k-fold CV to split the training set into k-folds where for each set of hyperparameters being tested, each time one of the folds is treated like the test set and the model is fitted to the rest. However, this time we have more hyperparameters to test.
TODO Implement the cv_best_hyperparams() function in the hw1/linear_regression.py module.
# Define search-spaces for hyper parameters
degree_range = np.arange(1, 4)
lambda_range = np.logspace(-3, 2, base=10, num=20)
# Use cross-validation to find best combination of hyperparameters
best_hypers = hw1linreg.cv_best_hyperparams(
model, x_train, y_train, k_folds=3,
degree_range=degree_range, lambda_range=lambda_range
)
print('Best hyperparameters: ', best_hypers)
# Make sure returned params exist in the model
for param in best_hypers.keys():
test.assertIn(param, model.get_params())
Best hyperparameters: {'linearregressor__reg_lambda': 0.23357214690901212, 'bostonfeaturestransformer__degree': 2, 'bostonfeaturestransformer__INDUS_threshold': 7, 'bostonfeaturestransformer__TAX_threshold': 200}
Now lets use the best hyperparameters to train a model on the training set and evaluate it on the test set.
# Use the best hyperparameters
model.set_params(**best_hypers)
# Train best model on full training set
y_pred_train, mse, rsq = linreg_boston(model, x_train, y_train)
print(f'train: mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_train, y_pred_train, res_label='train')
# Evaluate on test set
y_pred_test, mse, rsq = linreg_boston(model, x_test, y_test, fit=False)
print(f'test: mse={mse:.2f}, rsq={rsq:.2f}')
ax = plot_residuals(y_test, y_pred_test, ax=ax, res_label='test')
# Make sure test-set accuracy is good
test.assertLess(mse, 20) # You should be able to get way below this
train: mse=7.12, rsq=0.92 test: mse=16.58, rsq=0.76
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw1/answers.py.
from cs236781.answers import display_answer
import hw1.answers
Whats the ideal pattern to see in a residual plot? Based on the residual plots you got above, what can you say about the fitness of the trained model? Compare the plot for the top-5 features with the final plot after CV.
display_answer(hw1.answers.part4_q1)
Your answer:
A graph of the $y-\hat{y}$ as a function of $\hat{y}$ makes it possible to see the error itself directly and clearly.
Conversely, a graph of $\hat{y}$ as a function of $y$ contains more information: it shows the prediction model itself, thus allowing it to be compared to the desired values $y$. This comparison allows us to understand where the error comes from. For example, we can see whether $y$ relate linearly to the $X$ values, and thus we will know whether the problem can be solved by mapping to another feature space, and also to which feature space (e.g. polynomial mapping), or perhaps the relationship between $X$ and $y$ is indeed linear but our model is far from the desired prediction because the slope is not appropriate, then maybe the regulation is too strong and we should weaken it.
Another difference is that a graph of $\hat{y}$ as a function of $y$ can only be displayed for one feature at a time, or at most 2 features (if it's a 3D graph).
So in summary, the graph of $y-\hat{y}$ as a function of $\hat{y}$ makes it possible to clearly see the error itself, while a graph of $\hat{y}$ as a function of $y$ makes it easier to see the cause of the error, but harder to see the error itself. Therefore, if we are still tuning the hyperparameters in the model we would prefer to see the graph of $\hat{y}$ as a function of $y$, and if we have already finished training the model and we just want to see how accurate it is, we would prefer the graph of $y-\hat{y}$ as a function of $\hat{y}$. In this way we see that the mean residual in the test set is 0 just like in the training set, and that the gap between the mean and the standard deviation limits for the training is greater than the gap between the std of the training set and the std of the test set. The conclusion is that most of the error is due to bias and not variance, that is, the model is more susceptible to underfitting than overfitting.
This is a conclusion that can be easily seen from this plot pattern, but not from the alternative pattern
Explain the effect of adding non-linear features to our data.
display_answer(hw1.answers.part4_q2)
The model is no longer a linear regression model, because it does not find a linear relationship between $X$ and $y$. It does use linear regression as part of the model, but this use is inside a "black box": the model accepts inputs $X$, and returns $\hat{y}$ that are not linearly related to them.
In principle, we can fit any finite-time writable function. But in practice, the more parameters the function requires, the more difficult it will be to fit it without causing overfitting. Therefore we will prefer simpler functions.
Now the model will be a non-straight "hyperplane": a curved separator that still divides the space in two, but the boundary of the separation is not a linear hyperplane but curved in space.
Regarding the cross-validation:
np.logspace instead of np.linspace? Explain the advantage for CV.display_answer(hw1.answers.part4_q3)
Your answer:
The cross-validation process takes a long time. Therefore, we will prefer to shorten the search time by means of logarithmic compression of the search space.
I adjusted each of the hyperparameters separately, so the total number of fit operations is the sum of the parameter ranges:
$3+20+10+10+7 = 50$
If I were to do a grid search, I would get a much higher number: $3*20*10*10*7 = 42000$